In our first plot, linoleic is a continous variable with colored in hue’s of blue. The channel capacity for hue is only 10 levels and the problem of occusion with the data also deteriorates our understanding of the plot. In the second plot, We have linoleic variable segmented into 4 groups, this gives us a quicker understanding of the data showing us the relative values for the groups. The perception problem of relative judgement is affected as color hue comes with the highest error in human beings.
ol = read.csv("olive.csv", header = T, row.names = 1)
head(ol, 10)
## Region Area palmitic palmitoleic stearic oleic linoleic
## 1 1 North-Apulia 1075 75 226 7823 672
## 2 1 North-Apulia 1088 73 224 7709 781
## 3 1 North-Apulia 911 54 246 8113 549
## 4 1 North-Apulia 966 57 240 7952 619
## 5 1 North-Apulia 1051 67 259 7771 672
## 6 1 North-Apulia 911 49 268 7924 678
## 7 1 North-Apulia 922 66 264 7990 618
## 8 1 North-Apulia 1100 61 235 7728 734
## 9 1 North-Apulia 1082 60 239 7745 709
## 10 1 North-Apulia 1037 55 213 7944 633
## linolenic arachidic eicosenoic
## 1 36 60 29
## 2 31 61 29
## 3 31 63 29
## 4 50 78 35
## 5 50 80 46
## 6 51 70 44
## 7 49 56 29
## 8 39 64 35
## 9 46 83 33
## 10 26 52 30
ggplot(ol, aes(palmitic, oleic, col = linolenic)) + geom_point()
disc = cut_interval(ol$linolenic, 4)
ggplot(ol, aes(palmitic, oleic, col = disc)) + geom_point()
The 2nd plot is the easiest to analyse the plot with linolenic segmented into 4 groups. The size mapping creates the problem of occlusion due to overlapping. The orientaion angle map does not help either as the scatter plot many observations creates a high relative judement error. a)With Color hue 10 levels of feature can be percieved and 3.1bits can be decoded,With Color Brightness 5 levels and 2.1bits can be decoded. b)With size of object 4-5levels of feature can be percieved depending on human subject’s individualistic abilities, and 2.2bits can be decoded for this aesthetic. c)line orientation : 3bits can be decoded for this feature.
ggplot(ol, aes(palmitic, oleic, col = disc)) + geom_point()
ggplot(ol, aes(palmitic, oleic, size = disc)) + geom_point()
## Warning: Using size for a discrete variable is not advised.
ggplot(ol, aes(palmitic, oleic)) + geom_point() +
geom_spoke(angle = ol$linolenic, radius = 40)
Treisman’s theory of preattentive processing is showcased in this example. With no segmentation done by Region, we do not easily identify the boundary, but with the second plot we see the same much quickly due to preattentive preprocessing of contrast and luminance.
ggplot(ol, aes(oleic, eicosenoic, col = Region)) + geom_point()
ggplot(ol, aes(oleic, eicosenoic, col = cut_interval(Region,3))) + geom_point()
The 3 colors are each mapped with contrast and size and these feature maps are parallely processed in our brain as this creates a problem of preattentive preprocessing while analysing 27 different types of observation as we tend to wrongly group the data in our perception
ggplot(ol, aes(ol$oleic, ol$eicosenoic, col = cut_interval(ol$linoleic, 3),
shape = cut_interval(ol$palmitic, 3),
size = cut_interval(ol$palmitoleic, 3))) + geom_point()
## Warning: Using size for a discrete variable is not advised.
Size, contrast and shape are individual feature maps that are linked to different colors and hence preattentive preprocessing helps in this case.
ggplot(ol, aes(ol$oleic, ol$eicosenoic, col = ol$Region,
shape = cut_interval(ol$palmitic, 3),
size = cut_interval(ol$palmitoleic, 3))) + geom_point()
## Warning: Using size for a discrete variable is not advised.
Relative Judgement due to area is very high due to the plot made as a pie size as the dominant group of South- Apulia looks much larger than the other groups.
p <- plot_ly(ol, labels = ~Area, type = 'pie', showlegend = FALSE) %>%
layout(title = 'Pie Chart Area')
p
It is hard to look for outliers in the contour plot compared to the scatter plot. In the contour plot it shows we have 5 peak values but you wont be able to spot any difference for it in the scatter plot. The extreme values are not plotted in the contour plot. It is also hard to figure out clusters in the contour plot compared to a scatter plot which is a big issue in this plot.
ggplot(ol, aes(ol$linoleic, ol$eicosenoic)) + geom_density2d()
ggplot(ol, aes(ol$linoleic, ol$eicosenoic)) + geom_point()
The columns vary a lot in the range. Some values like BAvg are in a range of 0.235 to 0.282 while values like TB are in the range 2090 to 2615. But when we scale and apply MDS on the scaled values the stress level increases. The goodness of fit decreases. This is why I think we should not scale the values before applying MDS on it.
bball = read.xlsx("baseball-2016.xlsx", sheetName = "Sheet1", header = TRUE,
row.names = 1)
head(bball)
## League Won Lost Runs.per.game HR.per.game AB Runs
## Aizona Diamondbacks NL 69 93 4.64 1.172840 5665 752
## Atlanta Braves NL 68 93 4.03 0.757764 5514 649
## Baltimore Orioles AL 89 73 4.59 1.561728 5524 744
## Boston Red Sox AL 93 69 5.42 1.283951 5670 878
## Chicago Cubs NL 103 58 4.99 1.236025 5503 808
## Chicago White Sox AL 78 84 4.23 1.037037 5550 686
## Hits X2B X3B HR RBI StolenB CaughtS BB SO BAvg
## Aizona Diamondbacks 1479 285 56 190 709 137 31 463 1427 0.261
## Atlanta Braves 1404 295 27 122 615 75 34 502 1240 0.255
## Baltimore Orioles 1413 265 6 253 710 19 13 468 1324 0.256
## Boston Red Sox 1598 343 25 208 836 83 24 558 1160 0.282
## Chicago Cubs 1409 293 30 199 767 66 34 656 1339 0.256
## Chicago White Sox 1428 277 33 168 656 77 36 455 1285 0.257
## OBP SLG OPS TB GDP HBP SH SF IBB LOB
## Aizona Diamondbacks 0.320 0.432 0.752 2446 117 50 43 38 43 1113
## Atlanta Braves 0.321 0.384 0.705 2119 145 59 64 52 60 1161
## Baltimore Orioles 0.317 0.443 0.760 2449 119 44 17 36 19 1065
## Boston Red Sox 0.348 0.461 0.810 2615 137 43 8 40 34 1162
## Chicago Cubs 0.343 0.429 0.772 2359 107 96 42 37 45 1217
## Chicago White Sox 0.317 0.410 0.727 2275 122 53 29 44 16 1105
It is hard to see a difference between the legues in this plot. The points are well spread out so it is hard to tell if a MDS component is providing better differentiation between the leagues. I felt V1 was doing a better split between the leagues compared to V2. According to this plot “Los Angeles Angels”, “Boston Red Sox” and “Colorado Rocies” look like outliers.
bball.numeric = bball[,3:27]
distance = dist(bball.numeric)
res = isoMDS(distance, k=2, p=2)
## initial value 12.033362
## final value 12.032500
## converged
coords = res$points
coordsMDS = as.data.frame(coords)
coordsMDS$name = rownames(coordsMDS)
coordsMDS$league = bball$League
plot_ly(coordsMDS, x=~V1, y=~V2, type="scatter", mode = "markers"
, hovertext=~name, color= ~league)
MDS was able to decress the stress value upto 12%. Given that the dataset was 26 dimension and getting it down to 2 dimensions, with stress level of 2 is good. Some of the observation pairs was hard for MDS to map, and it was almost always including Chicago clubs. Some of the pairs that were hard to map are Chicago clubs and Arizona Diamondbacks, Chicago clubs and Baltimore Orioles, chicago clubs and Kansas city royals.
sh <- Shepard(distance, coords)
delta <-as.numeric(distance)
D<- as.numeric(dist(coords))
n=nrow(coords)
index=matrix(1:n, nrow=n, ncol=n)
index1=as.numeric(index[lower.tri(index)])
n=nrow(coords)
index=matrix(1:n, nrow=n, ncol=n, byrow = T)
index2=as.numeric(index[lower.tri(index)])
plot_ly()%>%
add_markers(x=~delta, y=~D, hoverinfo = 'text',
text = ~paste('Obj1: ', rownames(bball)[index1],
'<br> Obj 2: ', rownames(bball)[index2]))%>%
add_lines(x=~sh$x, y=~sh$yf)
Since V1 was spliting the leagues better, I plotted all the variables against it and found that TB(Total Bases) and OPS(On Base plus slugging) had the maximum positive connection between them. I have included the plots for the two. Both TB and OPS showed very strong positive connection. These two variables are very important in scoring the baseball teams.
plots_bball$P20 #OPS **
plots_bball$P21 #TB **
library(ggplot2)
library(plotly)
library(xlsx)
library(MASS)
library(gridExtra)
#Assignment 1
#Q1
ol = read.csv("olive.csv", header = T, row.names = 1)
head(ol, 10)
ggplot(ol, aes(palmitic, oleic, col = linolenic)) + geom_point()
disc = cut_interval(ol$linolenic, 4)
ggplot(ol, aes(palmitic, oleic, col = disc)) + geom_point()
#Q2
ggplot(ol, aes(palmitic, oleic, col = disc)) + geom_point()
ggplot(ol, aes(palmitic, oleic, size = disc)) + geom_point()
ggplot(ol, aes(palmitic, oleic)) + geom_point() +
geom_spoke(angle = ol$linolenic, radius = 40)
#Q3
ggplot(ol, aes(oleic, eicosenoic, col = Region)) + geom_point()
#Q4
ggplot(ol, aes(ol$oleic, ol$eicosenoic, col = cut_interval(ol$linoleic, 3),
shape = cut_interval(ol$palmitic, 3),
size = cut_interval(ol$palmitoleic, 3))) + geom_point()
#Q5
ggplot(ol, aes(ol$oleic, ol$eicosenoic, col = ol$Region,
shape = cut_interval(ol$palmitic, 3),
size = cut_interval(ol$palmitoleic, 3))) + geom_point()
#Q6
p <- plot_ly(ol, labels = ~Area, type = 'pie', showlegend = FALSE) %>%
layout(title = 'Pie Chart Area')
p
#Q7
ggplot(ol, aes(ol$linoleic, ol$eicosenoic)) + geom_density2d()
ggplot(ol, aes(ol$linoleic, ol$eicosenoic)) + geom_point()
#Assignment 2
#Q1
bball = read.xlsx("baseball-2016.xlsx", sheetName = "Sheet1", header = TRUE,
row.names = 1)
head(bball)
#Q2
bball.numeric = bball[,3:27]
distance = dist(bball.numeric)
res = isoMDS(distance, k=2, p=2)
coords = res$points
coordsMDS = as.data.frame(coords)
coordsMDS$name = rownames(coordsMDS)
coordsMDS$league = bball$League
plot_ly(coordsMDS, x=~V1, y=~V2, type="scatter", mode = "markers"
, hovertext=~name, color= ~league)
#Q3
sh <- Shepard(distance, coords)
delta <-as.numeric(distance)
D<- as.numeric(dist(coords))
n=nrow(coords)
index=matrix(1:n, nrow=n, ncol=n)
index1=as.numeric(index[lower.tri(index)])
n=nrow(coords)
index=matrix(1:n, nrow=n, ncol=n, byrow = T)
index2=as.numeric(index[lower.tri(index)])
plot_ly()%>%
add_markers(x=~delta, y=~D, hoverinfo = 'text',
text = ~paste('Obj1: ', rownames(bball)[index1],
'<br> Obj 2: ', rownames(bball)[index2]))%>%
add_lines(x=~sh$x, y=~sh$yf)
#Q4
bball$V1 = coordsMDS$V1
bball$V2 = coordsMDS$V2
cols_bball = colnames(bball)
dim(bball)[2]
plots_bball = list()
for(i in 2:27){
pl_name = paste("P", i, sep = '')
col_name = cols_bball[i]
plots_bball[[pl_name]] = ggplot(bball, aes_string("V1", col_name)) +
geom_point() + geom_line(aes(x=0))
}
grid.arrange(grobs = plots_bball, ncol = 6, nrow = 6)
plots_bball$P2
plots_bball$P3
plots_bball$P4 #Runs per game
plots_bball$P5
plots_bball$P6 #AB
plots_bball$P7 #Runs **
plots_bball$P8 #Hits **
plots_bball$P9
plots_bball$P10
plots_bball$P11
plots_bball$P12 #RBI **
plots_bball$P13
plots_bball$P14
plots_bball$P15
plots_bball$P16
plots_bball$P17 #BAvg
plots_bball$P18
plots_bball$P19 #SLG
plots_bball$P20 #OPS **
plots_bball$P21 #TB **
plots_bball$P22
plots_bball$P23
plots_bball$P24
plots_bball$P25
plots_bball$P26
plots_bball$P27
plots_bball2 = list()
for(i in 2:27){
pl_name = paste("P", i, sep = '')
col_name = cols_bball[i]
plots_bball2[[pl_name]] = ggplot(bball, aes_string("V2", col_name)) +
geom_point() + geom_line(aes(x=0))
}
grid.arrange(grobs = plots_bball2, ncol = 6, nrow = 6)
plots_bball2$P2
plots_bball2$P3
plots_bball2$P4
plots_bball2$P5
plots_bball2$P6
plots_bball2$P7
plots_bball2$P8
plots_bball2$P9
plots_bball2$P10
plots_bball2$P11
plots_bball2$P12
plots_bball2$P13
plots_bball2$P14
plots_bball2$P15
plots_bball2$P16
plots_bball2$P17
plots_bball2$P18
plots_bball2$P19
plots_bball2$P20
plots_bball2$P21
plots_bball2$P22
plots_bball2$P23
plots_bball2$P24
plots_bball2$P25
plots_bball2$P26
plots_bball2$P27